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Abstract 

Background: Ongoing advancements in cloud computing provide novel opportunities in scientific computing, 
especially for distributed workflows. Modern web browsers can now be used as high-performance workstations for 
querying, processing, and visualizing genomics' "Big Data" from sources like The Cancer Genome Atlas (TCGA) and the 
International Cancer Genome Consortium (ICGC) without local software installation or configuration. The design of 
QMachine (QM) was driven by the opportunity to use this pervasive computing model in the context of the Web of 
Linked Data in Biomedicine. 

Results: QM is an open-sourced, publicly available web service that acts as a messaging system for posting tasks and 
retrieving results over HTTP. The illustrative application described here distributes the analyses of 20 Streptococcus 
pneumoniae genomes for shared suffixes. Because all analytical and data retrieval tasks are executed by volunteer 
machines, few server resources are required. Any modern web browser can submit those tasks and/or volunteer to 
execute them without installing any extra plugins or programs. A client library provides high-level distribution 
templates including MapReduce. This stark departure from the current reliance on expensive server hardware running 
"download and install" software has already gathered substantial community interest, as QM received more than 2.2 
million API calls from 87 countries in 1 2 months. 

Conclusions: QM was found adequate to deliver the sort of scalable bioinformatics solutions that computation- and 
data-intensive workflows require. Paradoxically, the sandboxed execution of code by web browsers was also found to 
enable them, as compute nodes, to address critical privacy concerns that characterize biomedical environments. 

Keywords: Cloud computing, Crowdsourcing, Distributed computing, JavaScript, MapReduce, PaaS, Sequence 
analysis, Web service 



Availability 

A supporting QM deployment is available at https:// 
vl.qmachine.org, and its source code is available at 
https://github.com/wilkinson/qmachine. The illustrative 
examples and their dependencies are provided for live 
demonstration at http://q.cgr.googlecode.com/hg/index. 
html along with a screencast and archived genomic data. 

Background 

High-performance computing (HPC) for the life sciences 
is undergoing a fundamental reshaping [1]. The reliance 
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on processor-intensive resources through which ever- 
enlarging genomics workflows are funneled is giving way 
to distributed data-intensive infrastructures like TCGA 
and ICGC [2]. Accordingly, the immovable volumes that 
are flooding data centers demand "beyond the data del- 
uge" solutions [3] that invert the traditional transfer model 
so that computations travel to the data and not vice versa. 
The emphasis, then, is to maximize the availability of the 
data and the portability of the application. The increas- 
ing use of cloud computing infrastructure for biomedical 
applications reflects the realignment of HPC, as exempli- 
fied by the recent partnership between the National Insti- 
tute for Health (NIH) and Amazon on the 1000 Genomes 
Project [4]. 

At the same time, HPC projects such as SETI@home 
[5], Folding@Home [6], and BOINC [7] have constructed 
distributed platforms that aggregate commodity hardware 



o 



© 2014 Wilkinson and Almeida; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the 
BIoIVIGCI C6ntr8l creative Commons Attribution License (http://creativecommons.Org/licenses/by/4.0), which permits unrestricted use, 

distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public 
Domain Dedication waiver (http://creativecommons.0rg/publicdomain/zero/l .0/) applies to the data made available in this 
article, unless otherwise stated. 



Wilkinson and Almeida BMC Bioinformatics 2014, 15:1 76 
http://www.biomedcentral.eom/1 471 -2 1 05/1 5/1 76 



Page 2 of 1 2 



and volunteer compute cycles in order to power compu- 
tationally intensive scientific workflows. In fact, the Fold- 
ing@Home project currently utilizes the central and/or 
graphics processing units from more than 250,000 per- 
sonal computers and video game consoles [8]. To orches- 
trate concentrated efforts across such large numbers of 
physical machines and hardware platforms, researchers 
provide client applications that they must persuade vol- 
unteers to download and install permanently on their 
machines. These applications range in invasiveness from 
programs that run only when a machine is idle, such as the 
pioneering SETI@home, to always-on background ser- 
vices like Condor [9] that may tangibly impact a machine's 
performance. 

The World Wide Web provides a different avenue for 
HPC, and this is what we explore with QM - a novel 
direction. The temptation to optimize QM for a particular 
problem domain was overcome by the greater challenge 
of creating a system not only to distribute computation 
across the Web, but also to be "of the Web" itself. A careful 
study of the Web as a platform reveals that the necessary 
components are indeed ready for assembly. 

The JavaScript (JS) language is not only a "real lan- 
guage" [10] but also a "Lisp in Cs clothing" [11] with 
support for functional and object-oriented programming 
styles. Unlike Lisp, however, JS is widely used outside of 
academia and has ranked among the top twelve most pop- 
ular languages for more than thirteen years [12]. Scientific 
libraries in JS are relatively scarce, although a number of 
specialized libraries such as EBI's BioJS [13], NIH/NHGRI 
JBrowse [14], and the recent Genome Maps [15] have 
emerged to capitalize on the widespread availability of 
those computational resources, particularly in the genome 
browsing application domain. 

Web browsers execute JS in sandboxed environments 
that rigorously control access to machine resources, and 
now those sandboxes implement standardized APIs that 
provide native capabilities like hardware-accelerated 3D 
graphics. All modern browsers and even a few browser 
plugins include just-in- time (JIT) compilers to boost per- 
formance [16]. Regular expressions in JS, for example, 
perform at levels that are no longer matched by Perl 
[17], the language most often associated with string pro- 
cessing in bioinformatics applications. Moreover, these 
high-performance JS environments come pre-installed 
on every personal computer sold today, as well as on 
smartphones, tablets, gaming consoles, and even tele- 
visions. Thus, web browsers represent a modern route 
for high-performance computing that is well-suited for 
the "crowds ourcing" model [18]. Indeed, the current 
fast proliferation of bioinformatics libraries in JS also 
reflect the advent of web-based "social coding" envi- 
ronments which present entirely novel opportunities for 
large-scale collaboration [19]. Furthermore, the network- 



ing capabilities of the browser platform allow it to import 
code and data dynamically and thereby orchestrate dis- 
tributed workflows across multiple browsers on distinct 
machines, a feature at the core of social computing 
[20]. Therefore, what is described in this report could 
be construed as social computing for machines [21], 
extending the reach of loose distribution models such as 
mGrid [22]. 

The emergence of Big Data in the biomedical sciences 
has been associated with the proliferation of reference 
databases such as those reviewed yearly by Nucleic Acids 
Research [23]. The aggregation of the Web of Linked 
Data resources independent of the institutions that host 
them has been approached by comprehensive data mod- 
els such as the Distributed Annotation System [24], which 
we have also explored as a backbone for workflow assem- 
bly [25] . It is now amply clear, however, that the linking of 
data resources, regardless of the domain, is itself domain- 
neutral and best described by dyadic predicates of W3C s 
Resource Description Framework (RDF) that underlies 
the third generation of Web technologies [21,26,27]. 

The RDF approach has expanded the basic reliance of 
unique resource identifiers (URIs) both to identify and 
locate data (via URLs) which require only a web browser 
to be put to good use by any researcher, regardless of 
his expertise or domain of interest. The current extent 
of its use is dramatically illustrated by the adoption of 
the RDF framework across all data services of the Euro- 
pean Bioinformatics Institute [28]. As also illustrated by 
some of our work [29-31] developing SPARQL endpoints 
for TCGA, the volume of the server-side hosted data 
is not a significant obstacle to developing web appli- 
cations ("web apps") that consume those data. On the 
other hand, mechanisms to assemble workflows for data 
analysis have not yet matured as user-friendly commodi- 
ties, despite the availability of excellent frameworks like 
Taverna [32] and SHARE [33]. One possibility is that the 
underlying web services themselves need to be amenable 
to assembly at a moment's notice - even for deprecated 
or outdated versions of a procedure. This is an absolute 
requirement of the modern focus on reproducibility of 
workflow results [34]. We have explored the use of modu- 
lar browser-based web apps to deliver this functionality in 
standard bioinformatics applications such as image anal- 
ysis [35] and sequence analysis [36]. The success in those 
two efforts strengthens the claim that script tag loading, 
the same mechanism web browsers use to load web apps, 
can orchestrate and distribute the execution of bioinfor- 
matics workflows across multiple physical machines. The 
illustrative, and validating, example detailed in the Results 
section below will extend the same example of sequence 
analysis approached in the second of those two reports 
by analyzing twenty different genomes of Streptococcus 
pneumoniae in parallel. 
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Methods 

QM provides a distributed computing platform as a web 
service (PaaS). Its architecture (see Figure 1 in Results) 
combines the general pattern of Web 3.0 technologies 
with the model used by modern social networking sites 
by decoupling the presentation/analytical layer from the 
persistent representation layer so that the former runs on 
the client side as a web app that consumes an applica- 
tion programming interface (API) presented by the latter 
on the server side. QM also decouples the presentation 
and analytical layers of the web app so that third par- 
ties may embed the QM web service as part of their own 
web apps. 

To provide this functionality, QM contains three main 
components: an API server, a web server, and a web- 
site. The API and web servers are written completely in 
JS, and the website is written in HTML5, CSS, and JS. 
Nothing about QM s design or interface binds it to a par- 
ticular development stack, but our desire to construct the 
project as a true Web Computing "device" motivated us to 
implement as much of the code in JS as possible. The strat- 
egy paid unexpected dividends, as well; the server-side 
components are free from assumptions about the hard- 
ware and operating systems on which they run, which 
vastly simplifies deployment to the cloud via Platform-as- 
a-Service (PaaS) [37]. 

This report also presents code examples (see Results) 
which can be run from any website that embeds the QM 



web service. The examples are all written in JS, but some of 
them also make use of CoffeeScript, "a little language that 
compiles into JavaScript" [38]. Many common scientific 
languages can be translated to and from JS, and a com- 
prehensive list of projects for this purpose is available at 
http://bit.ly/altjsorg.. 

API server 

The API server is a program which responds to requests 
sent by clients over the standard Hypertext Transfer Pro- 
tocol (HTTP). The program then interprets the requests 
according to their methods, target URLs, and embed- 
ded data. QMs API presents three operations, as shown 
in Table 1. Data sent as part of a POST should be for- 
matted in JavaScript Object Notation (JSON) format; 
response data from QM are JSON-formatted, too. Note 
that clients need not be browsers - any software pack- 
age that can communicate over HTTP and manipulate 
JSON-formatted data can use QM directly. 

The API server is implemented as a simple Node.js 
[39] program that loads and executes all of its appli- 
cation logic from QMs own publicly available module, 
"qm", using the Node Package Manager (NPM) [40]. The 
module supports five different databases as targets for 
persistent storage: Apache CouchDB [41], MongoDB [42], 
PostgreSQL [43], Redis [44], and SQLite [45]. These five 
open-source databases were chosen for support based 
on their high performance and popularity, and their 




Figure 1 The abstract architecture. Architecture of a self-assembled QMachine highlighting the distribution of processing (rectangles) and 

transfer bandwidth (arrows). QM distributes not only the compute cycles needed to execute the n different procedures (Ei,2 n), but also the 

bandwidth needed the retrieve the corresponding input data (Di,2 n) being processed from their respective URLs (di,2 n)- The assembly is started 

by a submitter who issues a key that is endorsed by a number of volunteering web browser sessions [V] n). The code is then transferred to a queue 

on QM where it is picked up by the volunteering browsers. 
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Table 1 HTTP API 





HTTP request 




HTTP response 


Method 


Example URL 


Data 


Code Data 


GET 


/box/sean?key=some_job_id 




200 {} 


GET 


/box/sean?status=waiting 




200 [] 


POST 


/box/sean?key=some_jobJd 


{} 


201 



This table specifies the Application Programming Interface (API) understood by 
QM's API server. Request and response data use JavaScript Object Notation 
(JSON) format, but data may be omitted where values are left blank. The "{ }" 
denotes a JSON object, and the " [ ] " denotes a JSON array. 



differences in design help to guide the development of 
QM as an HPC solution for a heterogeneous database 
landscape. The relative merits of the alternative imple- 
mentations to the default use of MongoDB are as fol- 
lows. CouchDB and MongoDB are both document-centric 
NoSQL databases with MapReduce APIs that understand 
JS, but their designs are very different. CouchDB is more 
than just a database - it is nearly sufficient to imple- 
ment QM by itself because it bundles a web server and an 
HTTP-accessible API. MongoDB, by way of contrast, has 
an API that mimics the traditional relational style used by 
PostgreSQL and SQLite, with a much stronger focus on 
clustering and "sharding" (horizontal partitioning) across 
nodes. PostgreSQL represents relational database man- 
agement systems (RDBMS), the workhorses that tradi- 
tionally power enterprise applications and data ware- 
houses, while SQLite represents embedded (serverless) 
databases. Redis is an in-memory key-value store that is 
often referred to as a "data structure server" because its 
keys can contain strings, hashes, lists, sets, and sorted sets. 
The ability to map QMs persistent representation layer 
across such a wide variety of storage systems simplifies 
deployment and maintenance significantly. The service 
that backs this report's illustrative examples, available at 
https://vl.qmachine.org, uses MongoDB. 

QM's API server supports Cross-Origin Resource Shar- 
ing (CORS) [46] so that any webpage can embed QM to 
distribute workflows across web browsers without violat- 
ing the Same-Origin Policy [47] . There is wide support for 
CORS in web browsers [48]. 

Web server 

The web server, like the API server, is implemented as a 
Node.js program, and its logic is contained in the same 
NPM module, "qm". That is, the installation of all of 
QMachine's base libraries can be achieved simply by 
running Node's built-in module management system: 
npm install qm. It is worth recalling the minimal role 
played by the server-side components of QM (see Figure 1 
in Results). The web server exists only to provide the pre- 
sentation/analytical layer's resources to client machines. 
Because these resources are static, the web server can be 



replaced by off-the-shelf web servers like Apache [49] and 
Nginx [50]. 

Website 

The website functions as the presentation/analytical layer 
of QM. It was developed as a browser client for the 
QM API, and thus it is implemented in HTML5, CSS, 
and JS, as can be verified by inspecting its source code 
at https://github.com/wilkinson/qmachine. The website 
consists of a single webpage that communicates with the 
API server periodically via XMLHttpRequest (XHR) using 
a technique known as Asynchronous JavaScript and XML 
(AJAX). The "AJAX" name is a bit misleading, however, 
because XHR is not limited to handling XML data; all of 
the browser client's communications with the API server 
use JavaScript Object Notation (JSON). 

When a browser loads the webpage, it initially loads 
only the presentation layer, comprised of the HTML, CSS, 
and JS resources necessary to render the graphical user 
interface (GUI). Immediately after the GUI loads, the 
browser retrieves QM's analytical layer, which is written 
entirely in JS. This design improves the user experience 
by loading the GUI faster, and it isolates the presenta- 
tion layer's code from the analytical layer's code. Thus, 
third parties can embed QM's analytical layer and thereby 
use QM's persistent representation layer without load- 
ing QM's presentation layer, as shown by the examples at 
https://vl.qmachine.org/barebones.html and http://q.cgr. 
googlecode.com/hg/index.html. 

QM's browser client models a workflow as a set of trans- 
forms that should be applied to input data in a specific 
order to produce output data. A "task description" is an 
object that contains the transform f, the data and any 
information needed to prepare the environment prior to 
execution. 

As described above, the client-side application that 
is distributed when a browser is pointed to https://vl. 
qmachine.org was developed using only web technologies: 
HTML5, JS, and CSS. In order to stay within the core 
JS syntax that is supported by all browsers and all plat- 
forms - including mobile devices - code development was 
assisted by JSLint [51]. JSLint is also used directly within 
the analytical layer as a static analysis tool to identify tasks 
which can be serialized faithfully into JSON for distri- 
bution to volunteer machines. A generic library, Quanah 
[52], was also developed to solve the numerous con- 
currency challenges faced in asynchronous data transfer 
by QM; it is therefore a key component of the proto- 
type described here and is accordingly also made publicly 
available with open source. The presentation layer uses 
j Query [53] and Twitter Bootstrap [54] to ensure consis- 
tent look-and-feel across a variety of mobile and desk- 
top browsers. The GUI additionally attempts to support 
outdated browsers through optional integration with 
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Google Chrome Frame [55], HTML5 Shiv [56], and 
json2.js [57], but it does so only as a courtesy. 

Demonstration program 

The demonstration program is written in pure JavaScript 
so that it can be run in an ordinary web browser with- 
out dependencies on any native applications, plugins, or 
add-ons. It extends a demonstration from a previous study 
[36] in which a MapReduce decomposition of a sequence 
analysis procedure was used to find the longest similar 
segment between a user-given sequence and a full bac- 
terial genome. The demonstration in this study will not 
only reproduce the previous results using remote execu- 
tion on another machine, but it will do so in parallel for all 
of the twenty strains of Streptococcus pneumoniae that are 
currently available from the National Center for Biotech- 
nology Information (NCBI). It uses an updated version 
of the same Universal Sequence Maps (USM) [58] library 
used by the previous study, as referenced directly by exact 
version from its online repository. 

Results 

The architecture of a QMachine detailed in Figure 1 fol- 
lows the general pattern of Web 3.0 technologies by using 
the server side exclusively for persistent representation 
and leaving the rest of the program logic to run on the 
client side. QM uses a key-value architecture to orches- 
trate volunteering client machines in a manner that max- 
imizes the distribution of the computational resources 
required for data transfer and subsequent data process- 
ing. This orchestration is highlighted in Figure 1: QM 
distributes not only the compute cycles needed to exe- 
cute the n different procedures (i;i,2,...,«)> but also the 
bandwidth needed the retrieve the corresponding input 
data {Di^2,...,n) being processed from their respective URLs 
{di,2,...,n)^ This design is motivated by the constraints of 
biological applications such as next generation sequenc- 
ing in which the limiting factor is more often the available 
memory than the processor speed. 

The operation of QM relies on the creation of unique 
identifiers to define "boxes" that are then shared with the 
volunteering browsers in a manner resembling traditional 
API keys. This operation will be described in a series of 
four examples that increase in complexity, beginning with 
(1) the remote execution of a simple algebraic operation, 
followed by (2) distribution of the same operation as a 
parallel (map) transformation of the elements of an array 
and (3) distribution again as part of a MapReduce pro- 
cedure; finally, the (4) parallel execution of a real-world 
genomic sequence analysis in which both the code and 
the data needed to perform the analysis are invoked by a 
single submitter but then entirely resolved and executed 
asynchronously by multiple volunteer browsers. The final, 
real-world example distributes both the processing and 



networking loads, as described in Figure 2. It illustrates 
the ability of volunteer nodes to call code and data from 
multiple sources which are independently developed and 
maintained. This illustrative series is also available as a 
YouTube webcast at http://goo.gl/tnpMiQ. 

Loading the client-side library 

QM s analytical layer is provided by a }S library that can 
be loaded by any web browser automatically as part of any 
webpage that contains the following code: 

<script src="https : //vl . qmachine . org/q. j s" > 

</script> 

Once loaded, the JS environment will contain a global 
object named QM with convenient high-level methods that 
can be used to reproduce the results of the four examples 
that follow. 

( 1) Simple algebraic operation 

For the first illustrative example, let / be a function that 
increments a given number x by 2, and let ^ = 2. To com- 
pute the result,/(;v;), on a volunteer machine, we could use 
the QM . submit method: 

QM. submit (2, function (x) { 
return x + 2 ; 

}); 

As in the rest of the illustrative series, this example is 
described and demonstrated in the accompanying screen- 
cast (http://goo.gl/tnpMiQ). Note also that this simple 
operation is easily expressed in other languages such as 
CoffeeScript [38]): 

QM. submit (2, " (x) -> x + 2"); 

As discussed in "Methods", QM's architecture does not 
impose the use of a specific programming language, as 
long as a compiler to JS, the web's "assembler language" 
[35], is distributed with the remote call. To support this 
claim, the QM client library delegates to a compiler - 
written in JS - for the CoffeeScript language. For a list 
of compilers that can translate programs written in other 
languages into JS so that they can be interpreted by volun- 
teering browsers, see http://bit.ly/altjsorg. 

(2) Simple distributed map 

Because each QM . submit operation is an asynchronous 
call, multiple calls can run simultaneously. Thus, it is 
straightforward to distribute the execution of a "map" 
function, a higher-order functional pattern that applies 
the same operation to each element of an array. This 
pattern is so ubiquitous in scientific computing that it war- 
rants a dedicated method, QM . map, that can be used as 
follows: 

QM.map([l, 2, 3, 4, 5], " (x) -> x + 2"); 
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QMachine 




Figure 2 A workflow for real-world genomic analysis. (1 ) The submitter interactively calls the high-level QM . map function from a web browser 
with the URLs of twenty different Streptococcus pneumoniae genomes as input, causing the client to submit twenty individual task descriptions to 
QM's API server. (2) A volunteer's browser, polling for new task descriptions on QM's API server, finds and downloads a task description. (3) The 
volunteer's browser executes the task after downloading three external resources: the USM library served from GitHub, the JMat library served from 
Google Code, and the bacterial genome served from NCBI. (4) The volunteer's browser returns the results of the task execution to QM's API server 
and resumes polling for new task descriptions. (5) The submitter's browser, polling for updates to the individual task descriptions, retrieves the 
results from QM's API server. 



(3) Simple distributed MapReduce 

Just as in the "map" function shown above, it is straightfor- 
ward to distribute the execution of a "reduce" function, a 
higher-order functional pattern which combines elements 
of an array two at a time until only one value remains. 
As recently surveyed by Zou et al. [59], the MapReduce 
programming template is at the very core of modern com- 
putationally intensive bioinformatics applications. This 
third illustration demonstrates the MapReduce pattern as 
an extension of the second example by subsequently sum- 
ming the results of the distributed "map" using a "reduce" 
also distributes across QM s volunteers: 

QM.mapreduce ( [1, 2, 3, 4, 5], 

" (x) -> X + 2", "(a, b) -> a + b"); 

(4) Real-world genomic analysis 

The fourth illustrative example assesses QMs abil- 
ity to scale the asynchronous operations demonstrated 
above for use in a real-world bioinformatics workflow. 
The example is a Fractal MapReduce decomposition of 
sequence alignment [36] which distributes both the pro- 
cessing and networking loads across QMs volunteers, as 
described in Figure 1. It also demonstrates that libraries 



of any complexity or elaboration can be distributed to the 
volunteers along with the commands that invoke those 
libraries. Specifically, both the data and the library encod- 
ing for the sequence analysis procedure are invoked by 
QM but entirely resolved and executed by the volunteer 
browsers. It also illustrates the ability of a volunteer node 
to call code and data from multiple sources which are 
independently developed and maintained. 

Consider, as in the first example, that we have some x 
that we wish to transform via some function / so that x is 
now an array of URLs that reference FASTA files hosted 
by NCBI: 

X = [ 

"http : //ftp .ncbi .nlm.nih.gov/genomes/" + 
"Bacteria/Streptococcus_pneumoniae_" + 
"670_6B_uid52533/NC_014498 . fna" , 
/* 

* (eighteen URLs omitted) 
*/ 

"http : //ftp .ncbi .nlm.nih.gov/genomes/" + 
"Bacteria/Streptococcus_pneumoniae_" + 
"gamPNI0373_uidl75861/NC_018630 . fna" 

] ; 
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We want to perform a particular sequence analysis on 
each FASTA file, namely a Fractal MapReduce decompo- 
sition of the Chaos Game Representation [36]. Thus, we 
define a function f for use with the QM . map method that 
will take a URL as input and return the results of the 
sequence analysis as output: 

f = function (url) { 

var seq = "TCCACAGCATGCGTGACGATGACACG" ; 
return new usm(url, "ACGT" ) . alignQ ( seq) ; 

}; 

There is a key challenge, however, in that our function f 
depends on a usm function that exists only after an exter- 
nal library has been loaded. Therefore, to specify the task 
completely, we will need either to include usm as part of 
f or else to pass a reference to the library in the form of 
a URL. We chose the latter strategy in this case so that 
the library can be downloaded in parallel by each vol- 
unteer simultaneously without burdening the API server. 
Each external function may have multiple dependencies, 
and thus QM . map accepts an optional env parameter so 
that the dependencies for each external function can be 
specified as an array of URLs to be loaded sequentially: 

env = { 
usm: [ 

"http : //goo . gl/RSQ5i " , 
"http : //git . io/tAeG_w" 
] 

}; 

Finally, we will specify the box parameter explicitly for 
demonstration purposes. The box parameter takes the 
place of an API key and allows volunteers to execute tasks 
in a particular queue. This mechanism allows submitters 
to direct tasks into different queues and further enables 
the use of abstractions like MapReduce: 

box = " f asta-demo" ; 

Putting these definitions together, we now launch twenty 
individual genomic sequence analyses for simultaneous 
execution via 

QM.map(x, f, box, env) ; 

A full version of these examples can be found online at 
http://q.cgr.googlecode.com/hg/index.html. The version 
there includes the full URLs to all twenty Streptococcus 
pneumoniae genomes and also to the versioned libraries 
specified by env. An accompanying screencast for these 
examples is also provided in that page. 

Usage statistics 

The dissemination of browser-based tools in social cod- 
ing environments like GitHub [19] is characterized by the 
same expansive dynamics as social media. For example. 



although this is our first report describing it, QM can 
be - and has been - discovered by the community at large. 
During the 12 months period beginning in April 2013, QM 
received more than 2.2 million API calls from 2,100 IP 
addresses in 87 countries to more than 1,800 QM "boxes" 
(the code and results exchange domains defined by token), 
with 98 boxes receiving more than 1,000 calls each and 
16 boxes receiving 10,000 calls or more. The statistics 
of QM usage are described in Figure 3, and the geo- 
graphic distribution of its users is described in Figure 4. 
It is unclear exactly how much of QMs usage is asso- 
ciated with the distributed computational genomics web 
apps that motivated its development, but the wide geo- 
graphic distribution of its users suggests an appeal driven 
by a more general interest in distributed computing. This 
interpretation is reinforced by unsolicited reports about 
QM in HPC media such as HPCwire (article at http:// 
goo.gl/9H5W03) and insideHPC (http://goo.gl/bDkJZL). 
Finally, as noted in Methods, all of the server- and client- 
side software are open-source and permissively licensed. 
The browser client requires nothing more than script tag 
loading to be included in a web app, and the server is just 
as accessible through NPM [40]. It is therefore conceivable 
that other QM deployments are in use at other addresses, 
perhaps even within the firewall of medical centers, as was 
the specific intention of QM s development. 

The server load associated with orchestrating this ini- 
tial heavy use of QM is very modest because of the 
reliance on code distribution rather than on code exe- 
cution. In fact, the deployment supporting the usage 
statistics described above (the server behind https://api. 
qmachine.org) was never overwhelmed by traffic spikes 
even though it was running on a shared-tenant virtual 
machine with just 512 MB of RAM, 2 x 512 MB MongoDB 
databases, and no hard drive. Furthermore, the authors do 
not incur any maintenance costs for the public tool dis- 
semination, either from GitHub or from NPM's package 
repository. We are therefore committed not to collect any 
data beyond the broad statistics described in Figures 3 
and 4 for the reference deployment discussed here. Par- 
ticularly relevant for the biomedical usage scenario that 
motivated this work, we are also committed not to collect 
any data at all from private deployments of QM; in other 
words, no part of QM s software ever sends data back to 
our server(s) from other deployments. This design allows 
administrators to deploy their own QM servers through 
NPM and fully configure their own security as needed for 
clinical and/or biomedical research usage. These assur- 
ances can, of course, be verified through inspection of 
QM s source code. 

Discussion 

QMachine is a web service for executing distributed 
workflows that can use ordinary web browsers as the 
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Rank 

Figure 3 Three representations of usage data. This plot illustrates worldwide usage of the QMachine web service from log data collected from 
April 201 3 to April 201 4. More than 2.2 million calls were made to its Application Programming Interface (API), the details for which are shown in 
Table 1 .The thin, solid curve represents the number of calls made to its Application Programming Interface (API) server by hour; QM was idle (no 
calls received) during approximately 79% of all one-hour periods, and those hours were omitted. The thin, dashed line represents the number of API 
calls aggregated by IP address. The thick, dashed line represents the number of API calls made to a particular "box" on QM; see the Results section 
for further details about QM's boxes. 




Figure 4 Geograpiiical distribution of API calls. This illustrates the usage of the QMachine web service by country as a world map. More than 2.2 
million API calls were received from visitors in 87 countries, as identified by IP address. Each country's shade of green varies from pale to dark in 
relation to its rank as sorted from greatest to least. 
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ephemeral compute nodes of a crowdsourced supercom- 
puter. The idea here is simple: commodity computers 
equipped with web browsers join an abstract machine by 
visiting a website, and they unjoin by navigating to a dif- 
ferent site or by closing the browser. While a browser 
remains on the site, it reacts to input from the user and 
from the sites backend infrastructure by executing JS, 
which provides the abstract machine with some potential 
to perform computations. At any instant, the net compu- 
tational potential available to a high-traffic website falls 
well within the HPC range, as shown in Figure 5. QM 
enables this potential to be harnessed with no nominal 
cost through volunteer computing. 



MapReduce resources still expose researchers to pro- 
gramming environments with strict procedural require- 
ments and steep learning curves. QM is much simpler to 
set up and operate than Apache Hadoop [61], for exam- 
ple. It allows users to run MapReduce jobs on multiple 
physical machines and to crowdsource elastic computing 
resources with the simplicity of writing and loading a web- 
page - skills performed every day by millions of people 
worldwide. We argue that using the web computing archi- 
tecture explored by QM - that is, without installing a 
dedicated application - is a natural evolution of current 
cloud-based MapReduce services, just as Hadoop was a 
step up from one-off compile-and-run workflows. 



MapReduce 

Many researchers with access to large-scale computa- 
tional resources still find those resources inaccessible 
because "everyday" workflows often require more than 
just fast computers - they require programming skills that 
are harder to acquire. Bioinformatics workflows increas- 
ingly rely on MapReduce as an abstraction, but available 



Distributed computing 

QM s web service provides a message passing interface for 
distributed computing. This statement may at first sound 
paradoxical, but JS s single-threaded programming model 
does not limit JS programs to single-threaded execution; 
external execution contexts can be used to support con- 
currency via event-driven programming. QM leverages 
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Figure 5 Performance distribution of the "TopSGG" supercomputers. This liistogram sliows tlie floating-point performance distribution of tiie 
'TopSOO" fastest supercomputers, given in terms of representative commodity laptops. The data used were taken from the list publisiied in 
November 201 3 at http://top500.org and compared to the results obtained by running the same UNPACK benchmark [60] on a commodity laptop 
(Core i7-2720QM, 8GB 1 333MHz DDR3 SDRAM). Simple division of the real-world performance yielded the estimated performance as multiples of 
the commodity laptop. 
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browsers' asynchronous (non-blocking) network commu- 
nications layers to connect multiple machines' execution 
contexts, but browsers that support Web Workers [62] can 
execute concurrent programs within the same physical 
machine. 

Cloud browsers 

An interesting new twist in the development of web 
computing architectures is the emergence of the "cloud 
browser" [63]. In these systems, a mobile browser behaves 
as a thin client when a webpages scripts demand 
heavy computation. Cloud browsers therefore demon- 
strate browser scaling in the vertical direction, whereas 
QM demonstrates browser scaling in the horizontal 
direction. Because QM makes no assumptions about its 
volunteers' underlying resources, cloud browsers can vol- 
unteer for QM alongside ordinary browsers without loss 
of generality. In other words, cloud browsers repre- 
sent enhancements of present-day browsers, while QM 
presents a solution for HPC that advances the under- 
lying architecture of the Web towards that of a Global 
Computer [64,65]. 

Biomedical applications 

In clinical environments, it can be difficult or even impos- 
sible to distribute workflows due to privacy concerns that 
prevent sensitive data from leaving the hospital environ- 
ment, where conventional HPC is typically absent. QM 
satisfies this preoccupation without requiring additional 
resources. As shown in Figure 5, the median compu- 
tational power of the TopSOO HPC in November 2013 
(http://goo.gl/XIUIDP) was roughly 2,600 times faster 
than our lab s standard-issue desktop machine. This is a 
much smaller factor than the number of machines in a 
typical medical center. Thus, even if restricted to a single 
hospital environment, volunteer computing can still rival 
the total capacity of very substantial HPC resources. 

QM can also be used to power workflows inside of a sin- 
gle workstation. In such a scenario, the workstation would 
run QM's API server locally and use multiple browser tabs 
to execute the workflow in parallel. Such a workflow might 
also incorporate existing bioinformatics tools such as the 
Basic Local Alignment Search Tool (BLAST) [66] by using 
traditional server-side scripting languages like Perl [67] or 
Python [68] to connect to QM's API or even directly to the 
persistent storage layer. 

Security 

The security of workflows that use QM is handled orthog- 
onally to QM by the selection of volunteers and by access 
control to code and data. A number of considerations 
should nevertheless be made to assist in the configura- 
tion of its distributed operation. It is important to recall 
that the web browser executes JS within a sandboxed 



environment, which, among other protections, prevents 
programmatic access to the filesystem of the volunteer 
machine. As a result, QM's security is configured around 
two firewalls. 

The first and most basic protection is associated with 
the uniqueness of the "box" (token) issued by the submit- 
ter, which should be shared only with trusted volunteers. 
An additional layer of security can be added through the 
use of open authentication such as OAuth 2.0 [69] to ver- 
ify that only trusted volunteers are involved. This second 
layer of protection is particularly useful in creating audit 
trails. These two mechanisms can be combined in many 
ways, as appropriate for a particular workflow. For exam- 
ple, different steps of a workflow could be assigned to 
distinct cohorts of volunteers depending on the sensitiv- 
ity of the code and data and/or the trustworthiness of the 
volunteers. The resulting granularity could also be used to 
build redundancy - and therefore robustness - into the 
distributed QM operation. 

In short, the weakest link in QM's architecture - and 
where the opportunities for abuse lie - derive from the 
sharing of the "box" by members of a group of volunteers. 
In this regard, the key feature of QM's design is that the 
abuse can target the submitters but not the volunteers, 
because QM's operations take place within the sandbox of 
the web browser. 

Conclusions 

QMachine was developed to respond to the challenges 
of - and to capitalize on the opportunities of - bioin- 
formatics applications encountered in biomedical envi- 
ronments. For more than a decade, volunteer comput- 
ing has enticed computational biology as a scalable and 
cost-effective high-performance computing solution. QM 
essentially ports that solution to the modern compu- 
tational landscape, which is increasingly dominated by 
mobile hardware platforms and the use of the web browser 
as the universal software platform. The features of the 
modern web browser go beyond those that make it a high- 
performance computational environment with advanced 
communication layers; they also include the transforma- 
tive feature that computations run in a robust sandbox 
that prevents access to the underlying machine's poten- 
tially sensitive filesystem. QM also responds to another 
modern trend towards engaging HPC resources through 
the use of the MapReduce programming pattern, rather 
than through direct interactions with compute nodes. The 
sequence analysis application that illustrates the use of 
QM in this report offers the sort of immediate utility 
that would benefit bioinformatics applications in Medi- 
cal Genomics. It is argued, however, that QM, as an "of 
the Web" distributed computing system, may be just as 
useful in the identification of the fundamental features of 
pervasive web computing. 
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Availability of supporting data 

The Streptococcus pneumoniae genome data are used 
directly from the publicly available online repository at 
http://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/, and the 
relevant FASTA files have also been archived to http://q. 
cgr.googlecode.com/hg/data/, a version-controlled repos- 
itory. The original data used to produce Figure 5 
were taken from http://s.top500.org/static/lists/xml/ 
TOP500_201311_alLxml and are archived to http://q.cgr. 
googlecode.com/hg/data/. 

Source code 

All source code for this paper is version-controlled and 
open-sourced. The primary source for QMachine s code 
is located in a Git [70] repository at https://github.com/ 
wilkinson/qmachine. The code and data for the illustra- 
tive examples shown in the Results section are available 
in a Mercurial [71] repository at http://q.cgr.googlecode. 
com/hg/. Quanahs source code repository is available 
at https://github.com/wilkinson/quanah, and the USM 
repository is available at https://github.com/usm/usm. 
github.com. 
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